Part 1 : Automobile Domain - Train Regression Models to predict mpg

There seems to be certain values in "hp" column that seem to be not a number. Let us check out what it is.

Looks like they are question marks. So, let us replace them with NaN entries.

Total 6 NaN entries in the table. Those can be replaced by something relevant. So, lets replace them with median values.

Looks like the target columns is itself having two peaks implying possible two clusters of the data.

Observations:

  1. Almost equal number of vehicles are manufactured in all the years. Meaning that production is uniform, so there is less likelihood that there are any sudden dramatic changes in the models or breakthroughs in the vehicle technology, because - assuming the general principle that new technology always prompts new improvements and the increase in sales - the sales are pretty much uniform and smooth.
  2. An unusually large portion of the cars are from just one single regiion or origin. That might mean that market leaders are from that particular country. How it is related to everything else, remains to be seen.

Observations:

  1. Displacement and Horse Power most likely has two or three clusters or groups in itself.
  2. Acceleration is very smoothly distributed meaning that probably it has nothing to do with the actual data clustering or probably with any other variable.
  3. Horse power does have some outliers, which are most likely wrong entries because there most likely will be strict limitations for the amount of horsepower any engine can provide. A Strong contributing factor is weight. Power is directly related to the amount of fuel consumed to the amount of acceleration provided to the car/vehicle. So, their relation needs to be explored.
  4. Target column Miles per gallon is in itself smoothly distributed, indicating that clusters might not even be present and hard pressed to be found out.

Weight seems to be the main contributing factor behind a lot of things here (especially the target column) as it has significant covariance and relationship with other quantities of the dataset. But, further analysis is needed.

Notes: Looks like Acceleration does have a weak relationship with other variables within the table, espeically displacement and horsepower.

Observing all the above it is pretty clear that some columns like acceleration have got nothing to do direclty with the mileage or mpg. But "year", and "origin" seem to have been playing not so significant role in determining how much mileage is expected of a given vehicle, hence they can be ignored or simply not cared about.

There are four important columns "wt","disp","cyl","hp", which seem to have a good correlation with "mpg" column.

Observations: It seems like Mileage holds a really good relationship with number of cylinders, with very little variance in the cylinders and also staying at the peak of the regression plot, implying that most amount of mileage producing number. So, that's interesting observations. It is very weak, but mileage seem to be increasing as the cars get more and mroe latest, but it is not perfectly right to ascertain it yet. On the other hand mileage seem to be strongly correlated with the origin of the car. Is it truly so ?

Horsepower though strongly correlated with mileage, it is distributed widely, leaving us to wonder if the relation can be clearly defined. But, compared to other variables, it is probably true to treat those outliers with replacement or just leave them as be, depending on how many there are.

Looking at the above, 4 clusters seem to be the most appropriate number for us to generate and follow here. Highest correlation observed with disp and wt columsn each having 4 and 5 different flavour/groups in their own distributions respectively. Thefore, I opt for 4 to be an optimal choice.

Before we proceed we need to process the data, scale it appropriately and then try to fit into the model.

4 Clusters seem to be apt for this particular purpose, loooking at the above picture. As discussed earlier, we will be proceeding with 4

K-Means Clustering

Observations: Clusters don't seem to be too far away from each other given that there is so much less varied variation within the data, especially with respect to distributions.

Observations:

Most of the times it seems like the 3rd group holds most number of outliers for almost all of the features. That is very intersting to consider. That might possibly indicate that there needs to be perhaps a little different kind of model or relationship happening in that particular cluster between datapoints. We might want to use a different model for trainig. Perhaps even reduce the nubmer of clusters.

Notice that they produce radically different amount of clusters with same number of elements. It is really surprising here. Why did this happen ?

The possible reason I think behind them being different is because of some irrelevant columns within the data. The Weightage they put in for dendrogram is very high in comparison to the K-Means Clustering. In K-means they show their influence too, but there is an intercluster distance there and the method chosen i.e. the linkage we choose might vilify a lot of issues that might not be the case with dendrogram. Perhaps if we can change the method of linkage to average, instead of complete, we will have more the same, identical results ?

It is pretty obvious that there are irrelevant features everywhere in this. The main features which can be eliminated are, acc and wt.

Note : Above mentions how models are fit into the clusters and what is the Regression coefficient for that particular cluster in that particular feature.

Please Note: Trials to make different models on different clusters is done here.

Observations and Notes

• Weight of the vehicle is negatively correlated with vehicles. But the correlation is negative. That means the vehicle manufacturers should reduce the weight of the vehicle, implying that not to compromise on stability, instead of using heavy metals, they might be using light weight alloys that does more or less the same kind of work.
• The optimal cylinders seems to be 4 cylinders so that mileage has a minimum and mean better than other vehicles.
• Horse power also has negative correlation with mileage. Horse power indicates the engine power. If we want to see properly, we have  proper positive correlation between horsepower and weight clustered around the minimum point.
• The mileage seems to be increasing as time passes, but doesn't look like the increase at a rate that is desirable.
• To get the maximum mileage, the company should concentrate on putting 4 cylinder machines, that are light weight and higher horse power.
• The point of origin of the car seem to be of much less important than what is given here. Mileage of the car depends on how well maintained the car is, the profile of the car which includes the target of the car, the places it is driven in etc.
• So, it would be better if the company could separate the line of production i.e. Target differentiation while product is being manufacturing or during the planning stage itself so that different characteristics are enhanced for different purposes. E.g. Smaller or mileage oriented vehicles designed with light weight metals and capped top speed will increase vehicles. Separate those models from the high acceleration vehicles that are designed more faster freeways whose concentration won't be on the mileage.
• Company, if it wants to improve its sales, it should concentrate on generating more data focused on the outside shape i.e. Aerodynamic characteristics, ground clearance and axel configurations and wheels used along with Tyres, because they are the important factors in determining how much time vehicle spends at any particular speed and thus the mileage and its predictability improvement.
• While taking data, the company should concentrate more on how average mileage is changing with age of the vehicle. If a particular vehicle is performing well despite a long age, perhaps there is some special feature in the vehicle that might be enhancing its longevity and consistency in its performance. Hence, those readings would definitely help.
• Along with it perhaps the generalized survey of the areas where the vehicles were sold (i.e. pincodes) and the condition of roads in those areas might also help a lot in diagnosing what feature might be enhanced to what kind of area so that they give maximum yeild.
• Different models for different clusters is a bit too much to ask for such a small data set, however, we can clearly see that some models are fitting with a better score on certain clusters than other clusters. That is the proof that there is a score for optimization in future.